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Abstract 

Suppose that we are given a set of videos, along with nat¬ 
ural language descriptions in the form of multiple sentences 
(e.g., manual annotations, movie scripts, sport summaries 
etc.), and that these sentences appear in the same temporal 
order as their visual counterparts. We propose in this paper 
a method for aligning the two modalities, i.e., automatically 
providing a time (frame) stamp for every sentence. Given 
vectorial features for both video and text, this can be cast 
as a temporal assignment problem, with an implicit linear 
mapping between the two feature modalities. We formulate 
this problem as an integer quadratic program, and solve its 
continuous convex relaxation using an efficient conditional 
gradient algorithm. Several rounding procedures are pro¬ 
posed to construct the final integer solution. After demon¬ 
strating significant improvements over the state of the art on 
the related task of aligning video with symbolic labels [7], 
we evaluate our method on a challenging dataset of videos 
with associated textual descriptions [37], and explore bag- 
of-words and continuous representations for text. 


1. Introduction 

Fully supervised approaches to action categorization 
have shown good performance in short video clips [46]. 
However, when the goal is not only to classify a clip where 
a single action happens, but to compute the temporal extent 
of an action in a long video where multiple activities may 
take place, new difficulties arise. In fact, the task of identi¬ 
fying short clips where a single action occurs is at least as 
difficult as classifying the corresponding action afterwards. 
This is reminiscent of the gap in difficulty between catego¬ 
rization and detection in still images. In addition, as noted 
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Figure 1: An example of video to natural text alignment using our 
method on the TACoS [37] dataset. 


in [7], manual annotations are very expensive to get, even 
more so when working with a long video clip or a film 
shot, where many actions can occur. Finally, as mentioned 
in [13, 41], it is difficult to define exactly when an action 
occurs. This makes the task of understanding human activ¬ 
ities much more difficult than finding objects or people in 
images. 

In this paper, we propose to learn models of video con¬ 
tent with minimal manual intervention, using natural lan¬ 
guage sentences as a weak form of supervision. This has 
the additional advantage of replacing purely symbolic and 
essentially meaningless hand-picked action labels with a 
semantic representation. Given vectorial features for both 
video and text, we address the problem of temporally align¬ 
ing the video frames and the sentences, assuming the order 
is preserved, with an implicit linear mapping between the 
two feature modalities (Fig. 1). We formulate this problem 
as an integer quadratic program, and solve its continuous 
convex relaxation using an efficient conditional gradient al¬ 
gorithm. 


Related work. Many attempts at automatic image caption¬ 
ing have been proposed over the last decade: Duygulu et 
al. [9] were among the first to attack this problem; they pro¬ 
posed to frame image recognition as machine translation. 
These ideas were further developed in [3]. A second impor¬ 
tant line of work has built simple natural language models as 
conditional random fields of a fixed size [10]. Typically this 
corresponds to fixed language templates such as: (Object, 
Action, Scene). Much of the work on joint representations 
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of text and images makes use of canonical correlation anal¬ 
ysis (CCA) [19]. This approach has first been used to per¬ 
form image retrieval based on text queries by Hardoon et 
al. [17], who learn a kernelized version of CCA to rank im¬ 
ages given text. It has been extended to semi-supervised 
scenarios [42], as well as to the multi-view setting [14]. All 
these methods frame the problem of image captioning as a 
retrieval task [18, 33]. Recently, there has also been an im¬ 
portant amount of work on joint models for images and text 
using deep learning {e.g. [12, 23, 28, 43]). 

There has been much less work on joint representations 
for text and video. A dataset of cooking videos with asso¬ 
ciated textual descriptions is used to learn joint represen¬ 
tations of those two modalities in [37]. The problem of 
video description is framed as a machine translation prob¬ 
lem in [38], while a deep model for descriptions is proposed 
in [ 8 ]. Recently, a joint model of text, video and speech has 
also been proposed [29]. Textual data such as scripts, has 
been used for automatic video understanding, for example 
for action recognition [26, 31]. Subtitles and scripts have 
also often been used to guide person recognition models 
{e.g. [ 6 , 36, 44]). 

The temporal structure of videos and scripts has been 
used in several papers. In [7], an action label is associated 
with every temporal interval of the video while respecting 
the order given by some annotations (see [36] for related 
work). The problem of aligning a large text corpus with 
video is addressed in [45]. The authors propose to match 
a book with its television adaptation by solving an align¬ 
ment problem. This problem is however very different from 
ours, since the alignment is based only on character iden¬ 
tities. The temporal ordering of actions, e.g., in the form 
of Markov models or action grammars, has been used to 
constrain action prediction in videos [25, 27, 39]. Spatial 
and temporal constraints have also been used in the con¬ 
text of group activity recognition [ 1 , 24] . Similarity to our 
work, [47] uses a quadratic objective under time warping 
constraints. However it does not provide a convex relax¬ 
ation, and proposes an alternate optimization method in¬ 
stead. Time warping problems under constraints have been 
studied in other vision tasks, especially to address the chal¬ 
lenges of large scale data [35]. 

The model we propose in this work is based on discrim¬ 
inative clustering, a weakly supervised framework for par¬ 
titioning data. Contrary to standard clustering techniques, 
it uses a discriminative cost function [ 2 , 16] and it has 
been used in image co-segmentation [ 20 , 21 ], object co¬ 
localization [22], person identification in video [ 6 , 36], and 
alignment of labels to videos [7]. Contrary to [ ], for ex¬ 
ample, our work makes use of continuous text representa¬ 
tions. Vectorial models for words are very convenient when 
working with heterogeneous data sources. Simple sentence 
representations such as bags of words are still frequently 
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Figure 2: Illustration of some of the notations used in this paper. 
The video features 4> are mapped to the same space as text features 
using the map W. The temporal alignment of video and text fea¬ 
tures is encoded by the assignment matrix Y. Light blue entries in 

Y are zeros, dark blue entries are ones. See text for more details. 

used [14]. More complex word and sentence representa¬ 
tions can also be considered. Simple models trained on 
a huge corpus [32] have demonstrated their ability to en¬ 
code useful information. It is also possible to use differ¬ 
ent embeddings, such as the posterior distribution over la¬ 
tent classes given by a hidden Markov model trained on the 
text [15]. 

1.1. Problem statement and approach 

Notation. Let us assume that we are given a data stream, 
associated with two modalities, represented by the features 
= [ 01 ,...,(/)/] in and = [ 01 ,..., 0 j] in 
In the context of video to text alignment, ^ is a description 
of the video signal, made up of / temporal intervals, and 
is a textual description, composed of J sentences. However, 
our model is general and can be applied to other types of 
sequential data (biology, speech, music, etc.). In the rest 
of the paper, except of course in the experimental section, 
we stick to the abstract problem, considering two generic 
modalities of a data stream. 

Problem statement. Our goal is to assign every element 
Hn {1,..., /} to exactly one element j in {1,..., J}. At 
the same time, we also want to learn a linear map^ between 
the two feature spaces, parametrized by W in If 

the element i is assigned to an element j, we want to find 
W such that 0^ W4>i. If we encode the assignments in 

a binary matrix Y, this can be written in matrix form as: 
^ (Fig. 2). The precise definition of the matrix 

Y will be provided in Sec. 2. In practice, we insert zero 
vectors in between the columns of This allows some 
video frames not to be assigned to any text. 

Relation with Bojanowski et al. [7]. Our model is an ex¬ 
tension of [7] with several important improvements. In [7], 

^ As usual, we actually want an affine map. This can be done by simply 
adding a constant row to 
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instead of aligning video with natural language, the goal 
is to align video to symbolic labels in some predefined 
dictionary of size K (“open door”, “sit down”, etc.). By 
representing the labeling of the video using a matrix Z in 
{0,the problem solved there corresponds to find¬ 
ing W and Z such that: Z ^ W^. The matrix Z encodes 
both data (which labels appear in each clip and which order) 
and the actual temporal assignments. Our parametrization 
allows us instead to separate the representation from the 
assignment variable Y. This has several significant advan¬ 
tages: first, this allows us to consider continuous text rep¬ 
resentations as the predicted output in instead of 

just classes. As shown in the sequel, this also allows us to 
easily impose natural, data-independent constraints on the 
assignment matrix Y. 

Contributions. This article makes three main contribu¬ 
tions: (i) we extend the model proposed in [ ] in order 
to work with continuous representations of text instead 
of symbolic classes; (ii) we propose a simple method for 
including prior knowledge about the assignment into the 
model; and (iii) we demonstrate the performance of the pro¬ 
posed model on challenging video datasets equipped with 
natural language meta data. 


2. Proposed model 
2.1. Basic model 

Let us begin by defining the binary assignment matrices 
F in {0, The entry Yji is equal to one if i is assigned 

to j and zero otherwise. Since every element i is assigned 
to exactly one element j, we have that j = 1/, where 
1/c represents the vector of ones in dimension k. As in [ ], 
we assume that temporal ordering is preserved in the as¬ 
signment. Therefore, if the element i is assigned to j, then 
i 1 can only be assigned to j or j + 1. In the following, 
we will denote by y the set of matrices Y that satisfy this 
property. Our recursive definition allows us to obtain an ef¬ 
ficient dynamic programming algorithm for minimizing lin¬ 
ear functions over y, which is a key step to our optimization 
method. 

We measure the discrepancy between and W^ using 
the squared L 2 loss. Using an L 2 regularizer for the model 
W, our learning problem can now be written as: 

min min Lll^y _ ^$||2 + ^||^||2 
Yey weR^x^ 21" 2 " 


We can rewrite (1) as: miny^^; g(F), where q : y -^Ris 
defined for all F in F by: 


q(Y) = min 




■ ( 2 ) 


and its solution is: 

w* = + /Aldz,) , (3) 

where Idk is the identity matrix in dimension k. Substitut¬ 
ing in (2) yields: 

q{Y) = Lxr {^YQY^^^) , (4) 

where Q is a matrix depending on the data and the regular¬ 
ization parameter A: 

Q = Id/ - + /Aldr.) (5) 


Multiple streams. Suppose now that we are given N data 
streams (videos in our case), indexed by n in A^}. 

The approach proposed so far is easily generalized to this 
case by taking and ^ to be the horizontal concatenation 
of all the matrices and The matrices F in F are 
block-diagonal in this case, the diagonal blocks being the 
assignment matrices of every stream: 


Vi 


F = 


0 


0 

Yn 


This is the model actually used in our implementation. 

2.2. Priors and constraints 

We can incorporate task-specific knowledge in our 
model by adding constraints on the matrix F to model event 
duration for example. Constraints on F can also be used to 
avoid the degenerate solutions known to plague discrimina¬ 
tive clustering [2, 7, 16, 20]. 

Duration priors. The model presented so far is solely 
based on a discriminative function. Our formulation in 
terms of an assignment variable F allows us to reason about 
the number of elements i that get assigned to the element 
j. For videos, since each element i correponds to a fixed 
time interval, this number is the duration of text element j. 
More formally, the duration 5{j) of element j is obtained as: 
5{j) = ejFl/, where is the j-th vector of the canonical 
basis of Assuming for simplicity a single target dura¬ 
tion li and variance parameter a for all units, this leads to 
the following duration penalty: 

r{Y) = y,\\Yli - ^Wl (6) 


Path priors. Some elements of F correspond to very un¬ 
likely assignments. In speech processing and various re¬ 
lated tasks [34], the warping paths are often constrained, 
forcing for example the path to fall in the Sakoe-Chiba band 
or in the Itakura parallelogram [40]. Such constraints allow 


For a fixed F, the minimization with respect to W in (2) is 
a ridge regression problem. It can be solved in closed form. 


3 







(a) A (near) degenerate solution. 



Figure 3: (a) depicts a typical near degenerate solution where al¬ 
most all the the elements i are assigned to the first element, close 
to the constant vector element of the kernel of Q. (b) We propose 
to avoid such solutions by forcing the alignment to stay outside 
of a given region (shown in yellow), which may be a band or a 
parallelogram. The dark blue entries correspond to the assignment 
matrix Y, and the yellow ones represent the constraint set. See 
text for more details. (Best seen in color.) 


us to encode task-specific assumptions and to avoid degen¬ 
erate solutions associated with the fact that constant vectors 
belong to the kernel of Q (Fig. 3 (a)). Band constraints, 
as illustrated in Fig. 3 (b), successfully exclude the kind of 
degenerate solutions presented in (a). Let us denote by Yc 
the band-diagonal matrix of width /3, such that the diagonal 
entries are 0 and the others are 1; such a matrix is illus¬ 
trated in Fig. 3 (b) in yellow. In order to ensure that the 
assignment does not deviate too much from the diagonal, 
we can impose that at most C non zero entries of Y are out¬ 
side the band. We can formulate that constraint as follows: 
Yx{Y^Y) < C. 

This constraint could be added to the definition of the set 
y, but this would prohibit the use of dynamic programming, 
which is a key step to our optimization algorithm described 
in Sec. 3. We instead propose to add a penalization term to 
our cost function, corresponding to the Lagrange multiplier 
for this constraint. Indeed, for any value of ( 7 , there exists 
an a such that if we add 

l{Y)=dYr{YjY), (7) 

to our cost function, the two solutions are equal, and thus 
the constraint is satisfied. In practice, we select the value of 
a by doing a grid search on a validation set. 

2.3. Full problem formulation 

Including the constraints defined in Sec. 2.2 into our ob¬ 
jective function yields the following optimization problem: 

mmq{Y)+r{Y) + l{Y), ( 8 ) 

where g, r and I are the three functions respectively defined 
in (4), (6) and (7). 


3. Optimization 

3.1. Continuous relaxation 

The discrete optimization problem formulated in Eq. (8) 
is the minimization of a positive semi-definite quadratic 
function over a very large set y, composed of binary as¬ 
signment matrices. Following [7], we relax this problem 
by minimizing our objective function over the (continuous) 
convex hull y instead of y. Although it is possible to de¬ 
scribe y in terms of linear inequalities, we never use this 
formulation in the following, since the use of a general lin¬ 
ear programing solver does not exploit the structure of the 
problem. Instead, we consider the relaxed problem: 

min q{Y)Yr{Y)Yl{Y) (9) 

rey 

as the minimization of a convex quadratic function over an 
implicitly defined convex and compact domain. This type 
of problem can be solved efficiently using the Frank-Wolfe 
algorithm [7, 11] as soon as it is possible to minimize linear 
forms over the convex compact domain. 

First, note that y is the convex hull of Y, and 
the solution to miny^y? Tt{AY) is also a solution of 
miny^^ Tt{AY) [5]. As noted in [7], it is possible to min¬ 
imize any linear form Tt{AY), where A is an arbitrary ma¬ 
trix, over y using dynamic programming in two steps: First, 
we build the cumulative cost of matrix D whose entry (i, j) 
is the cost of the optimal alignment starting in (1,1) and 
terminating in (i, j). This step can be done recursively in 
0{IJ) steps. Second, we recover the optimal Y by back¬ 
tracking in the matrix D. See [7] for details. 

3.2. Rounding 

Solving (9) provides a continuous solution F* in 3^ and 
a corresponding optimal linear map FL*. Our original prob¬ 
lem is defined on F, and we thus need to round F*. We pro¬ 
pose three rounding procedures, two of them corresponding 
to Euclidean norm minimization and a third one using the 
map IF*. All three roundings boil down to solving a lin¬ 
ear problem over F, which can be done once again using 
dynamic programming. Since there is no principled, ana¬ 
lytical way to pick one of these procedures over the others, 
we conduct an empirical evaluation in Sec. 5 to assess their 
strengths and weaknesses. 

Rounding in y. The simplest way to round F* is to find 
the closest point F according to the Euclidean distance in 
the space F- miny^y ||F — F*|||.. This problem can be 
reduced to a linear program over F- 

Rounding in T^F* This is in fact the space where the orig¬ 
inal least-squares minimization is formulated. We solve 
in this case the problem miny^y ||T^(F — F*)|||., which 
weighs the error measure using the feature A simple cal¬ 
culation shows that the previous problem is equivalent to: 

min Tr (l7Diag(«''^5')^ - 25'^4'F*)). (10) 
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(a) Y fixed to ground truth. (b) Corresponding constraints. 

Figure 4: Two ways of incorporating supervision, (a) the assign¬ 
ments are fixed to the ground truth: the dark blue entries exactly 
correspond to Ys, and yellow entries are forbidden assignments; 
(b) the assignments are constrained. For even rows, assignments 
must be outside the yellow strips. Light blue regions correspond 
to authorized paths for the assignment. 


Rounding in W. Our optimization procedure gives us two 
outputs, namely a relaxed assignment Y* e y and a model 
FF* in We can use this model to predict an align¬ 

ment Y in 3^ by solving the following quadratic optimiza¬ 
tion problem: miny^d^ ~ As before, this is 

equivalent to a linear program. An important feature of this 
rounding procedure is that it can also be used on previously 
unseen data. 


4. Semi-supervised setting 

The proposed model is well suited to semi-supervised 
learning. Incorporating additional supervision just consists 
in constraining parts of the matrix Y. Let us assume that 
we are given a triplet Y^) representing supervisory 

data. The part of data that is not involved in that supervi¬ 
sion is denoted by ^u)- Using the additional data 

amounts to solving (8) with matrices T>, Y) defined as: 


^ = [$«, k^s],Y 


Yu 

0 



The parameter k, allows us to weigh properly the supervised 
and unsupervised examples. Scaling the features this way 
corresponds to using the following loss: 


W^uYu - W^ufp + - W^sfp. ( 12 ) 

Since Y^ is given, we can optimize over y while constrain¬ 
ing the lower right block of Y. In our implementation this 
means that we fix the lower-right entries in Y to the ground- 
truth values during optimization. 

Manual annotations of videos are sometimes imprecise, 
and we thus propose to include them in a softer manner. As 
mentioned in Sec. 2, odd columns in are filled with zeros. 
This allows some video frames not to be assigned to any 
text. Instead of imposing that the assignment Y coincides 
with the annotations, we constrain it to lie within annotated 
intervals. For any even (non null) element j, we force the 
set of video frames that are assigned to j to be a subset of 


those in the ground truth (Fig. 4). That way, we allow the as¬ 
signment to pick the most discriminative parts of the video 
within the annotated interval. This way of incorporating su¬ 
pervision empirically yields much better performance. 

5. Experimental evaluation 

We evaluate the proposed approach on two challenging 
datasets. We first compare it to a recent method on the as¬ 
sociated dataset [7]. We then run experiments on TACoS, 
a video dataset composed of cooking activities with textual 
annotations [37]. We select the hyper parameters X, a, a, k, 
on a validation set. All results are reported with standard 
error over several random splits. 

Performance measure. All experiments are evaluated us¬ 
ing the Jaccard measure in [7], that quantifies the difference 
between a ground-truth assignment Ygt and the predicted Y 
by computing the precision for each row. In particular the 
best performance of 1 is obtained if the predicted assign¬ 
ment is within the ground-truth. If the prediction is outside, 
it is equal to 0. 

5.1. Comparison with Bojanowski et al. [7] 

Our model is a generalization of Bojanowski et al. [7]. 
Indeed, we can easily cast the problem formulated in that 
paper into our framework. Our model differs from the 
aforementioned one in three crucial ways: First, we do not 
need to add a separate “background class”, which is always 
problematic. Second, we propose another way to handle the 
semi-supervised setting. Most importantly, we replace the 
matrix Z by T^Y, allowing us to add data-independent con¬ 
straints and priors on Y. In this section we describe compar¬ 
ative experiments conducted on the dataset proposed in [7]. 

Dataset. We use the videos, labels and features provided 
in [7]. This data is composed of 843 videos (94 videos are 
set aside for a classification experiement) that are annotated 
with a sequence of labels. There are 16 different labels such 
as e.g. “Eat”, “Open Door” and “Stand Up”. As in the orig¬ 
inal paper, we randomly split the dataset into ten different 
validation, evaluation and supervised sets. 

Features. The label sequences provided as weak supervi¬ 
sory signal in [ ] can be used as our features We consider 
a language composed of sixteen words, where every word 
corresponds to a label. Then, the representation of every 
element j is the indicator vector of the j-th label in the se¬ 
quence. Since we do not model background, we simply in¬ 
terleave zero vectors in between meaningful elements. The 
matrix T> corresponds to the video features provided with 
the paper’s code. These features are 2000-dimensional bag- 
of-words vectors computed on the HOF channel. 

Baselines. As our baseline, we run the code from [7] that 
is available online^ for different fractions of annotated data, 

^https://github.com/piotr-bojanowski/action-ordering 
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Figure 5: Comparing our approach with the various rounding 
schemes to the model in [ ] on the same data, using the same eval¬ 
uation metric as in [ ]. See text for details. 

seeds and parameters. As a sanity check, we compare the 
performance of our algorithm to that of a random assign¬ 
ment that follows the priors. This random baseline obtains 
a performance measure of 32.8 with a standard error of 0.3. 

Results. We plot performance versus amount of supervised 
data in Fig. 5. We use the same evaluation metric as in [ ]. 
First of all, when no supervision is available, our method 
works significantly better (no overlap between error bars). 
This can be due (1) to the fact that we do not model back¬ 
ground as a separate class; and (2) to the use of the priors 
described in Sec. 2.2. As additional supervisory data be¬ 
comes available, we observe a consistent improvement of 
more than 5% over [7] for the and W roundings. The 
Y rounding does not give good results in general. 

The main interesting point is the fact that the drop at the 
beginning of the curve in [ ] does not occur in our case. 
When no supervised data is available, the optimal F* solely 
depends on the video features T>. When the fraction of an¬ 
notated data increases, the optimal F* changes and depends 
on the annotations. However, the temporal extent of an ac¬ 
tion is not well defined. Therefore, manual annotations need 
not be coherent with the F* obtained with no supervision. 
Our way of dealing with supervised data is less coercive and 
does not commit strictly to the annotated data. 

In Fig. 5 we have observed that the best performing 
rounding on this task is the one using the matrix product 
T^F. It is important to notice that [ ] performs rounding 
on a matrix F = T^F which is thus equivalent to the best 
performing rounding for our method. In preliminary exper¬ 
iments, we observed that using a W rounding for [ ] does 
not significantly improve performance. 

5.2. Results on the TACoS dataset 

We also evaluate our method on the TACoS dataset [37] 
which includes actual natural language sentences. On this 
dataset, we use the W rounding as it is the one that empiri¬ 
cally gives the best test performance. We do not have yet a 
compelling explanation as to why this is the case. 


Dataset. TACoS is composed of 127 videos picturing peo¬ 
ple who perform cooking tasks. Every video is associated 
with two kinds of annotations. The first one is composed 
of low-level activity labels with precise temporal location. 
We do not make use of these fine-grained annotations in 
this work. The second one is a set of natural language de¬ 
scriptions that were obtained by crowd-sourcing. Annota¬ 
tors were asked to describe the content of the video using 
simple sentences. Each video ^ is associated with k tex¬ 
tual descriptions ..., Every textual description 

is composed of multiple sentences with associated temporal 
extent. We consider as data points the pairs $) fork 
in{l,...,K}. 

Video features. We build the feature matrix ^ by com¬ 
puting dense trajectories [46] on all videos. We com¬ 
pute dictionaries of 500 words for HOG, HOE and MBH 
channels. These experimentally provide satisfactory per¬ 
formance while staying relatively low-dimensional. Eor a 
given temporal window, we concatenate bag-of-words rep¬ 
resentations for the four channels. As in the Hellinger ker¬ 
nel, we use the square root of Li normalized histograms as 
our final features. We use overlapping temporal windows of 
length 150 frames with a stride of 50. 

Text features. To apply our method to textual data, we need 
a feature representation for each sentence. In our experi¬ 
ments, we explore multiple ways to represent sentences and 
empirically compare their performance (Table 1). We dis¬ 
cuss two ways to encode sentences into vector representa¬ 
tions, one based on bag of words, the other on continuous 
word embeddings [32]. 

To build our bag-of-words representation, we construct 
a dictionary using all sentences in the TACoS dataset. We 
run a part-of-speech tagger and a dependency parser [30] in 
order to exploit the grammatical structure. These features 
are pooled using three different schemes. (1) ROOT: In this 
setup, we simply encode each sentence by its root verb as 
provided by the dependency parser. (2) ROOT-fDOBJ: In 
this setup we encode a sentence by its root verb and its di¬ 
rect object dependency. This representation makes sense on 
the TACoS dataset as sentences are in general pretty simple. 
Eor example, the sentence “The man slices the cucumber” 
is represented by “slice” and “cucumber”. (3) VNA: This 
representation is the closest to the usual bag-of-words text 
representation. We simply pool all the tokens whose part of 
speech is Verb, Noun or Adjective. The two first representa¬ 
tions are very rudimentary versions of bags of words. They 
typically contain only one or two non zero elements. 

We also explore the use of word embeddings 
(W2V) [32], trained on three different corpora. Eirst, 
we train them on the TACoS corpus. Even though the 
amount of data is very small (175,617 words), the vocabu¬ 
lary is also limited and the sentences are simple. Second, 
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Figure 6: Evaluation of the priors we propose in this paper, (a) We plot the performance of our model for various values of a. When a is 
big, the prior has no effect. We see that there is a clear trade off and an optimal choice of a yields better performance, (b) Performance as 
a function of a and width of the band. The shown surface is interpolated to ease readability, (c) Performance for various values of a. This 
plot corresponds to the slice illustrated in (b) by the black plane. 


we train the vector representations on a dataset of 50,993 
kitchen recipes, downloaded from allrecipes.com. This cor¬ 
responds to a corpus of roughly 5 million tokens. However, 
the sentences are written in imperative mode, which differs 
from the sentences found in TACoS. For completeness, we 
also use the WaCky corpus [4], a large web-crawled dataset 
of news articles. We train representations of dimension 
100 and 200. A sentence is then represented by the 
concatenation of the vector representations of its root verb 
and its root’s direct object. 

Baselines. On this dataset, we considered two baselines. 
The first one is Bojanowski et al. [7] using the ROOT tex¬ 
tual features. Verbs are used in place of labels by the 
method. The second one, that we call Diagonal, corre¬ 
sponds to the performance obtained by the uniform align¬ 
ment, i.e. assigning the same amount of video elements i to 
each textual element j. 

Evaluation of the priors. We proposed in Sec. 2.2 two 
heuristics for including prior knowledge and avoiding de¬ 
generate solutions to our problem. In this section, we eval¬ 
uate the performance of these priors on TACoS. To this end, 
we run our method with the two different models separately. 
We perform this experiment using the ROOT-fDOBJ text 
representation. The results of this experiment are illustrated 
in Fig. 6. 

We see that both priors are useful. The duration prior, 
when a is carefully chosen, allows us to improve perfor¬ 
mance from 0.441 (infinite a) to 0.475. There is a clear 


text representation Dim. 100 Dim. 200 

W2V UKWAC 43.8 (1.5) 46.4 (0.7) 

W2V TACoS 48.3 (0.4) 48.2 (0.4) 

W2V ALLRECIPE 43.3 (0.7) 44.7 (0.5) 


trade-off in this parameter. Using a bit of duration prior 
helps us to get a meaningful V* by discarding degenerate 
solutions. However, when the prior is too strong, we obtain 
a degenerate solution with decreased performance. 

The band prior (as depicted in Eig. 6, b and c) improves 
performance even more. We plot in (b) the performance as 
a joint function of the parameter a and of the width of the 
band /S. We see that the width that provides the best perfor¬ 
mance is 0.1. We plot in (c) the corresponding performance 
as a function of a. Using large values of a corresponds to 
constraining the path to be entirely inside the band, which 
explain why the performance flattens for large a. When us¬ 
ing a small width, the best path is not entirely inside the 
band and one has to carefully choose the parameter a. 

We show in Eig. 6 the performance of our method for 
various values of the parameters on the evaluation set. 
Please note however that when used in other experiments, 
the actual values of these parameters are chosen on the val¬ 
idation set only. Sample qualitative results are shown in 
Fig. 7 

Evaluation of the text representations. In Table 1, we 
compare the continuous word representations trained on 
various text corpora. The representation trained on TACoS 
works best. It is usually advised to retrain the representation 


text representation 

nosup 

semisup 

Diagonal 

35.2 (3.7) 

Bojanowski et al [7] 

39.0(1.0) 

49.1 (0.7) 

ROOT 

49.9 (0.2) 

59.2(1.0) 

ROOT-fDOBJ 

48.7 (0.9) 

65.4(1.0) 

VNA 

45.7 (1.4) 

59.9 (2.9) 

W2V TACoS 100 

48.3 (0.4) 

60.2(1.5) 


Table 1: Comparison of text representations trained on different 
corpora, in dimension 100 and 200. 
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Table 2: Performance when no supervision is used to compute the 
assignment (nosup) and when half of the dataset is provided with 
time stamped sentences (semisup). 






















The person gets out the The person washes the The person removes both The person cuts up each The person throws away the The person washes the dishes, 

beans, a knife, a cutting beans. ends of each bean and puts bean into small pieces and ends of the beans, 

board, a plastic container, the ends in the plastic puts them into the mixing 

and a mixing bowl. container. bowl. 



The person sets up chopping 
board, knife and stainless 
steel bowl. 


The person takes the broad 
beans out of the fridge and 
places in bowl while 
checking for freshness. 


The person vigorously 
washes the beans under 
running water. 


The person chops of both 
ends of each bean. 


The person chops the 
beans into 1/4 to 1/2 inch 
segments. 


The person places chopped 
beans on the freshly rinsed 
plate. 



He grabs an orange. 


He gets a knife and 
cutting board. 


He cuts the orange in half. He juices the orange. 


He pours the juice into a 
glass. 


He adds sugar to the glass 
and stirs. 


Figure 7: Representative qualitative results for our method applied on TACoS. Correctly assigned frames are in green, incorect ones in red. 


on a text corpus that has similar distribution to the corpus 
of interest. Moreover, higher-dimensional representations 
(200) do not help probably because of the limited vocabu¬ 
lary size. The representations trained on a very large news 
corpus (UKWAC) benefits from using higher-dimensional 
vectors. With such a big corpus, the representations of the 
cooking lexical field are probably merged together. This 
is further demonstrated by the fact that using embedings 
trained on Google News provided weak performance (42.1). 

In Table 2, we experimentally compare our approach 
to the baselines, in an unsupervised setting and a semi- 
supervised one. First, we observe that the diagonal baseline 
has reasonable performance. Note that this diagonal assign¬ 
ment is different from a random one since a uniform as¬ 
signment between text and video in our context makes some 
sense. Second, we compare to the method of [7] on ROOT, 
which is the only set up where this method can be used. 
This baseline is higher than the diagonal one but pretty far 
from the performances of our model using ROOT as well. 

Using bag-of-words representations, we notice that sim¬ 
ple pooling schemes work best. The best performing rep¬ 
resentation is purely based on verbs. This is probably due 
to the fact that richer representations can mislead such a 
weakly supervised method. As additional supervision be¬ 
comes available, the ROOT-fDOBJ pooling works much 
better that only using ROOT validating the previous claim. 

6. Discussion. 

We presented in this paper a method able to align a video 
with its natural language description. We would like to ex¬ 
tend our work to even more challenging scenarios including 


feature movies and more complicated grammatical struc¬ 
tures. Also, our use of natural language processing tools 
is limited, and we plan to incorporate better grammatical 
reasoning in future work. 
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